Recently, there has been an increasing interest in end-to-end speechrecognition that directly transcribes speech to text without any predefinedalignments. In this paper, we explore the use of attention-basedencoder-decoder model for Mandarin speech recognition on voice search. Wepropose a smoothing method for attention mechanism and compare with contentattention and convolutional attention. Moreover, frame skipping is employed forfast training and convergence. On the XiaoMi TV voice search dataset, weachieve a character error rate (CER) of 3.58% and a sentence error rate (SER)of 7.43% without using any lexicon or language model. While together with atrigram language model, we reach 2.81% CER and 5.77% SER.
展开▼